Skip to content

Comments

Extracts truncated UTF-8 sequence validation to BitString class#718

Draft
mward-sudo wants to merge 5 commits intobartblast:devfrom
mward-sudo:02-18-extracts_truncated_utf-8_sequence_validation_to_bitstring_class
Draft

Extracts truncated UTF-8 sequence validation to BitString class#718
mward-sudo wants to merge 5 commits intobartblast:devfrom
mward-sudo:02-18-extracts_truncated_utf-8_sequence_validation_to_bitstring_class

Conversation

@mward-sudo
Copy link
Contributor

@mward-sudo mward-sudo commented Feb 18, 2026

Closes #714

Dependencies

Please note that this PR includes commits from the PR(s) it is dependent upon. Once the dependent PR(s) are merged to the dev branch, then this PR will be rebased and will then only contain its own commits. This PR will remain in draft until that point.

Summary by CodeRabbit

  • New Features

    • Added robust UTF-8 validation and decoding utilities to improve text handling and truncation detection.
  • Refactor

    • Simplified UTF-8 validation flow by centralizing sequence checks and reducing internal helper complexity.
  • Tests

    • Added extensive UTF-8 test coverage covering 1–4 byte sequences, truncation cases, overlong/surrogate/out-of-range cases, and continuation-byte validation.

@coderabbitai
Copy link

coderabbitai bot commented Feb 18, 2026

Note

Reviews paused

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds UTF‑8 decoding, validation, and truncation detection utilities to Bitstring and refactors the Erlang unicode module to use those utilities; duplicates of the new utilities appear in bitstring.mjs and tests for the utilities are added.

Changes

Cohort / File(s) Summary
Bitstring UTF‑8 utilities
assets/js/bitstring.mjs
Added five static methods: decodeUtf8CodePoint(bytes,start,length), isValidUtf8CodePoint(codePoint,encodingLength), isValidUtf8ContinuationByte(byte), isValidUtf8Sequence(bytes,start,length), and isTruncatedUtf8Sequence(bytes,start). Note: utilities are present in two locations within the file (duplication).
Unicode module refactor
assets/js/erlang/unicode.mjs
Removed local UTF‑8 decoding/validation helpers; findValidUtf8Length now returns a numeric length. Call sites updated to use Bitstring.isValidUtf8Sequence and Bitstring.isTruncatedUtf8Sequence for validation and truncation checks.
Tests
test/javascript/bitstring_test.mjs
Added extensive tests covering decoding, continuation byte checks, code point validation, sequence validation, and truncation detection across 1–4 byte sequences and edge cases (overlong, surrogates, out‑of‑range, truncated sequences).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • bartblast
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Out of Scope Changes check ❓ Inconclusive The PR contains scope-related changes: UTF-8 utilities appear duplicated in bitstring.mjs, and tests duplicate validation coverage. While the core functionality aligns with #714, the duplications suggest incomplete refactoring. Verify that duplicated UTF-8 methods and test cases in bitstring.mjs and bitstring_test.mjs are intentional, or consolidate them to maintain a single source of truth for UTF-8 validation logic.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: extracting UTF-8 truncated sequence validation logic to the Bitstring class, which is the primary objective of this PR.
Linked Issues check ✅ Passed The PR successfully implements the main objectives from #714: introduces BitString.isTruncatedUtf8Sequence(), adds supporting UTF-8 helper methods, rationalizes findValidUtf8Length(), and updates call sites to use the new BitString methods.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
assets/js/bitstring.mjs (1)

258-259: Consider hoisting constant lookup objects out of the method bodies.

firstByteMasks (Line 259) and minValueForLength (Line 602) are re-created on every call. If these methods are called in a tight loop (e.g., scanning long byte arrays), the repeated object allocation is unnecessary overhead. Hoisting them to module-level or static class-level constants eliminates this.

♻️ Suggested change
+// Module-level constants for UTF-8 decoding
+const UTF8_FIRST_BYTE_MASKS = {2: 0x1f, 3: 0x0f, 4: 0x07};
+const UTF8_MIN_VALUE_FOR_LENGTH = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000};
+
 export default class Bitstring {

Then use them in the respective methods:

   static decodeUtf8CodePoint(bytes, start, length) {
     if (length === 1) return bytes[start];
-    const firstByteMasks = {2: 0x1f, 3: 0x0f, 4: 0x07};
-    let codePoint = bytes[start] & firstByteMasks[length];
+    let codePoint = bytes[start] & UTF8_FIRST_BYTE_MASKS[length];
   static isValidUtf8CodePoint(codePoint, encodingLength) {
-    const minValueForLength = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000};
-    if (codePoint < minValueForLength[encodingLength]) return false;
+    if (codePoint < UTF8_MIN_VALUE_FOR_LENGTH[encodingLength]) return false;

Also applies to: 601-602

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/bitstring.mjs` around lines 258 - 259, The objects firstByteMasks
and minValueForLength are being re-created inside function bodies on every call
causing unnecessary allocations; move them to module-level constants (e.g.,
define const FIRST_BYTE_MASKS = {2:0x1f,3:0x0f,4:0x07} and const
MIN_VALUE_FOR_LENGTH = {...}) and replace the in-function usages of
firstByteMasks and minValueForLength with these hoisted constants in the
functions that reference them (search for identifiers firstByteMasks and
minValueForLength to update all occurrences).
assets/js/erlang/unicode.mjs (1)

670-672: Nit: inconsistent comment casing.

Line 670 uses lowercase "scan forward" while the other four copies (Lines 98, 309, 563, 776) use "Scan forward". Trivial, but one more reason to extract the function.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/erlang/unicode.mjs` around lines 670 - 672, The comment "scan
forward, validating each sequence" is inconsistently cased compared to other
occurrences; update this comment to "Scan forward, validating each sequence" to
match the other instances (or better, extract the repeated logic into a helper
function used by the blocks that declare let pos = 0 to remove duplicated
comments). Locate the block containing the variable pos (the scanning loop) and
either change the comment text to title case or refactor the scan logic into a
single function (e.g., scanForward/validateSequence) and replace the duplicated
comments with a single well-named function call.
test/javascript/bitstring_test.mjs (1)

5267-5344: Clarify and de‑duplicate truncated‑sequence tests.
Two 4‑byte truncation tests use identical input, and the first test’s comment mentions a continuation byte that isn’t present. Consider removing the duplicate and aligning the comment or sample bytes for clarity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/javascript/bitstring_test.mjs` around lines 5267 - 5344, The two tests
asserting truncation for 4-byte sequences duplicate the same input ([0xf0, 0x90,
0x8d]) and the first test's comment is inconsistent; remove the duplicate test
(either the one titled "returns true for truncated 4-byte sequence with valid
continuation bytes" or "returns true for 4-byte sequence with 2 valid
continuation bytes (truncated)") and keep a single test that uses
Bitstring.isTruncatedUtf8Sequence with bytes [0xf0, 0x90, 0x8d], updating its
comment to state "only 3 bytes available (2 valid continuation bytes)" so the
test name, comment and sample bytes are aligned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@assets/js/erlang/unicode.mjs`:
- Around line 97-112: The helper findValidUtf8Length is duplicated in five
places; move it into the Bitstring utility and call it from unicode.mjs instead
of inlining; specifically, add a method (e.g.,
Bitstring.findValidUtf8Length(bytes)) to bitstring.mjs that performs the same
loop using Bitstring.getUtf8SequenceLength and Bitstring.isValidUtf8Sequence,
export/attach it to the Bitstring class, then in unicode.mjs replace each
in-file findValidUtf8Length definition with direct calls to
Bitstring.findValidUtf8Length(bytes) from the functions characters_to_list/1,
characters_to_nfc_binary/1, characters_to_nfd_binary/1,
characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1.

---

Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Around line 258-259: The objects firstByteMasks and minValueForLength are
being re-created inside function bodies on every call causing unnecessary
allocations; move them to module-level constants (e.g., define const
FIRST_BYTE_MASKS = {2:0x1f,3:0x0f,4:0x07} and const MIN_VALUE_FOR_LENGTH =
{...}) and replace the in-function usages of firstByteMasks and
minValueForLength with these hoisted constants in the functions that reference
them (search for identifiers firstByteMasks and minValueForLength to update all
occurrences).

In `@assets/js/erlang/unicode.mjs`:
- Around line 670-672: The comment "scan forward, validating each sequence" is
inconsistently cased compared to other occurrences; update this comment to "Scan
forward, validating each sequence" to match the other instances (or better,
extract the repeated logic into a helper function used by the blocks that
declare let pos = 0 to remove duplicated comments). Locate the block containing
the variable pos (the scanning loop) and either change the comment text to title
case or refactor the scan logic into a single function (e.g.,
scanForward/validateSequence) and replace the duplicated comments with a single
well-named function call.

In `@test/javascript/bitstring_test.mjs`:
- Around line 5267-5344: The two tests asserting truncation for 4-byte sequences
duplicate the same input ([0xf0, 0x90, 0x8d]) and the first test's comment is
inconsistent; remove the duplicate test (either the one titled "returns true for
truncated 4-byte sequence with valid continuation bytes" or "returns true for
4-byte sequence with 2 valid continuation bytes (truncated)") and keep a single
test that uses Bitstring.isTruncatedUtf8Sequence with bytes [0xf0, 0x90, 0x8d],
updating its comment to state "only 3 bytes available (2 valid continuation
bytes)" so the test name, comment and sample bytes are aligned.

@mward-sudo mward-sudo force-pushed the 02-18-extracts_truncated_utf-8_sequence_validation_to_bitstring_class branch from 6d289bb to 9ab7bc3 Compare February 18, 2026 23:51
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
assets/js/bitstring.mjs (1)

255-269: Object literals firstByteMasks and minValueForLength are re-allocated on every call.

Both decodeUtf8CodePoint (line 259) and isValidUtf8CodePoint (line 602) construct a new plain-object lookup table on each invocation. These tables are constant and could be hoisted to module-level or class-level const values to avoid repeated allocation, especially since these methods are called inside tight loops during UTF-8 scanning.

♻️ Proposed refactor (illustrative — apply equivalent changes to both methods)
+const FIRST_BYTE_MASKS = {2: 0x1f, 3: 0x0f, 4: 0x07};
+const MIN_VALUE_FOR_LENGTH = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000};

 static decodeUtf8CodePoint(bytes, start, length) {
   if (length === 1) return bytes[start];

-  const firstByteMasks = {2: 0x1f, 3: 0x0f, 4: 0x07};
-  let codePoint = bytes[start] & firstByteMasks[length];
+  let codePoint = bytes[start] & FIRST_BYTE_MASKS[length];
   ...
 }

 static isValidUtf8CodePoint(codePoint, encodingLength) {
-  const minValueForLength = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000};
-  if (codePoint < minValueForLength[encodingLength]) return false;
+  if (codePoint < MIN_VALUE_FOR_LENGTH[encodingLength]) return false;
   ...
 }

Also applies to: 596-612

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/bitstring.mjs` around lines 255 - 269, Hoist the per-call lookup
objects into shared constants to avoid re-allocation: move the firstByteMasks
and minValueForLength objects used in decodeUtf8CodePoint and
isValidUtf8CodePoint out of the functions and define them once at module (or
class static) scope as const FIRST_BYTE_MASKS and MIN_VALUE_FOR_LENGTH, then
replace the in-function object literals with references to those constants;
ensure you update both decodeUtf8CodePoint and isValidUtf8CodePoint to use the
new constant names (and keep behavior identical).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@assets/js/bitstring.mjs`:
- Around line 638-657: The function isTruncatedUtf8Sequence currently reads
bytes[start] without bounds checking, causing out-of-range starts to be treated
as a valid truncated UTF-8; add an explicit bounds guard at the top of
isTruncatedUtf8Sequence to immediately return false when start is outside the
buffer (start < 0 or start >= bytes.length) before calling
$.getUtf8SequenceLength or $.isValidUtf8ContinuationByte, keeping the rest of
the logic intact.

---

Duplicate comments:
In `@assets/js/erlang/unicode.mjs`:
- Around line 97-112: The function findValidUtf8Length is duplicated across the
five Unicode helper blocks; extract a single shared implementation (keep the
current behavior that returns a numeric pos and uses
Bitstring.getUtf8SequenceLength and Bitstring.isValidUtf8Sequence) into one
reusable helper (e.g., a top-level/exported findValidUtf8Length in the module)
and replace the other verbatim copies by calling that shared helper; ensure the
helper name and signature remain findValidUtf8Length(bytes) returning a number
so callers in the existing Unicode helper functions continue to work without
further changes.

---

Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Around line 255-269: Hoist the per-call lookup objects into shared constants
to avoid re-allocation: move the firstByteMasks and minValueForLength objects
used in decodeUtf8CodePoint and isValidUtf8CodePoint out of the functions and
define them once at module (or class static) scope as const FIRST_BYTE_MASKS and
MIN_VALUE_FOR_LENGTH, then replace the in-function object literals with
references to those constants; ensure you update both decodeUtf8CodePoint and
isValidUtf8CodePoint to use the new constant names (and keep behavior
identical).

@mward-sudo
Copy link
Contributor Author

@coderabbitai Don't do any further reviews on this PR while it is in draft.

@coderabbitai
Copy link

coderabbitai bot commented Feb 19, 2026

@mward-sudo: Understood! I won't perform any further reviews on this PR while it remains in draft. The automatic reviews are now paused. Feel free to mention me again when you're ready for a review after the PR is out of draft status.

✅ Actions performed

Reviews paused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extract UTF-8 truncated sequence detection to Bitstring class

1 participant